LILI: A Simple Language Independent Approach for Language Identification

نویسندگان

  • Mohamed Al-Badrashiny
  • Mona T. Diab
چکیده

We introduce a generic Language Independent Framework for Linguistic Code Switch Point Detection. The system uses the word length, character level (1, 2, 3, 4, and 5)-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We test our proposed framework and compare it to the state-of-theart published systems on standard data sets from several language pairs: English-Spanish, NepaliEnglish, English-Hindi, Arabizi (Refers to Arabic written using the Latin/Roman script)-English, Arabic-Engari (Refers to English written using Arabic script), Modern Standard Arabic(MSA)Egyptian, Levantine-MSA, Gulf-MSA, one more English-Spanish, and one more MSA-EGY. The overall weighted average F-score of each language pair are 96.4%, 97.3%, 98.0%, 97.0%, 98.9%, 86.3%, 88.2%, 90.6%, 95.2%, and 85.0% respectively. The results show that our approach despite its simplicity, either outperforms or performs at comparable levels to state-of-theart published systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...

متن کامل

Language and Task Independent Text Categorization with Simple Language Models

We present a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models. Our approach uses simple information theoretic principles and achieves effective performance across a variety of languages and tasks without requiring feature selection or extensive pre-processing. To demonstrate the language and task independen...

متن کامل

Text Classification using Language-independent Pre-processing

A number of language-independent text pre-processing techniques, to support multi-class single-label text classification, are described and compared. A simple but effective statistical keyword identification approach is proposed, coupled with a number of phrase identification mechanisms. Experimental results are presented.

متن کامل

On the Link between Identity Processing and Learning Styles among Young Language learners

The present study attempted to investigate the probable relationship between Iranian young language learners’ identity processing styles and their learning styles. To this end, 29 advanced learners, 23 females and 6 males were randomly selected from an English language Institute. Twenty nine advanced young language learners were chosen randomly out of whole advanced young language learners in t...

متن کامل

Error Taxonomy of TOEFL iBT Writing: An Iranian Perspective

TOEFL iBT has turned recently heads to the impacts language tests can have on language learning. Since error analysis-based instruction has gained a new life with the advent of the computer analysis of the learner’s language, the researchers of this study embarked on examining a sample of integrated and independent writing tasks of 45 Iranian TOEFL iBT candidates in order to identify and classi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016